pycoQC repository contains 6 example sequencing summary files generated with various version of Albacore. Each of those files contains only 10,000 reads.
Larger versions of these files are also available from https://www.ebi.ac.uk/~aleg/data/pycoQC_test/
pycoQC is a simple class that is initialized with a text summary file generated by ONT Albacore. For 1D run use the file named sequencing_summary.txt available the root of Albacore output directory. For 1D2, use sequencing_1dsq_summary.txt that cam be found in the 1dsq_analysis directory.
The instantiated object can be subsequently called with various methods that will generates tables and plots.
There are a few different ways to get help for all the public package functions:
?pycoQC.channels_activityhelp (pycoQC.channels_activity)shift + tab Import pycoQC main class as well as Plotly and enable inline plotting in the current Notebook.
This is the recommended option. This ensures that your all your data are stored inside the notebook.
The limitation is that if generating many plots with large datasets the notebook will become quite heavy and slow.
# Run cell with Ctrl + Enter
from pycoQC.pycoQC import pycoQC
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode (connected=False)
This option takes advantage of Plotly web-service for hosting graphs. This requires to set up an account (https://plot.ly/python/getting-started/#initialization-for-online-plotting) and to provide credentials in the notebook. This could be a good option for easy sharing of the interactive plots generated by pycoQC.
# Only run this cell if you have set up a plotly account before and wants to use Plotly web-service
# from plotly.plotly import plot, iplot
# import plotly.tools as pt
# pt.set_credentials_file (username="XXXXXXXXXX", api_key="XXXXXXXXXX")
Upon initialization pycoQC reads the sequencing summary file, runs a series of tests and pre-process the data for plotting methods.
PycoQC can read compressed sequencing_summary.txt files (‘gzip’, ‘bz2’, ‘zip’, ‘xz’). Instead of a single file, it is also possible to pass a UNIX style regex to match multiple files
Depending on the run type and the version of Albacore used some informations might not be available. In particular calibration reads were not flagged in early versions of Albacore. When the field is available those reads are automatically discarded. Similarly barcodes information are only available in multiplexed runs.
The type of run (1D or 1D2) is automatically detected but can be explicitly enforced with run_type if needed
There are often several runids in a single sequencing_summary file. Unfortunately there are no ways to know the correct order based on the information contained in the sequencing_summary.txt file alone. By default pycoQC will automatically reorder the runs by decreasing throughput, which should normally reflect the sequencing order. However if you know the order you can specify it at initialisation with the option runid_list. This option can also be used to select specific run IDs
By default pycoQC assumes that the minimal mean quality for a "pass" read is 7 (same as default Albacore value). However if you want to adjust the value, you can specify it at initialisation with min_pass_qual.
help (pycoQC.__init__)
# Run cell with Ctrl + Enter
p = pycoQC("./data/Albacore-1.7.0_basecall-1D-DNA_small_sequencing_summary.txt.gz")
print (p)
p = pycoQC("./data/*RNA*", verbose_level=3)
Plots are generated with plotly for Python and return a plotly Figure object that can be used by users for:
iplot (either from plotly.plotly or plotly.offline)plot (either from plotly.plotly or plotly.offline)In this notebook we will use the inline plotting option with the offline plotly library
Users can also customize the figures online in a user friendly environment by clicking on "Edit in Chart Studio" in the upper right corner of each figures.

Similarly static pictures can be exported using the "Download plot as a png" button.

All the methods have the arguments width and height that can be used to customize the plotting area. In general we do not recommend modifing these values as they might disrupt the plot layout.
Most of the methods also have the argument sample. By default pycoQC downsample the number of reads to 100,000 before plotting. This drastically reduces the processing time for large dataset and has a very limited impact on the plot aspect. The sampling is random but deterministic, meaning that you should always obtain the same results for the same dataset. The value can be changed to increase or decrease the number of reads. Alternatively, one can deactivate the behavior by specifying sample=False.
The summary method generate a simple summary table with a clickable button to switch from "all reads" to "pass reads" only
help(pycoQC.summary)
# Run cell with Ctrl + Enter
p = pycoQC("./data/*RNA_small_sequencing_summary.txt.gz")
fig = p.summary()
iplot(fig, show_link=False)
pycoQC has 3 methods to visualize the distribution of mean quality scores and of estimated read length:
reads_len_1D: An histogram of estimated read length in logarithmic scalereads_qual_1D: An histogram of mean quality scoresreads_len_qual_2D: A density contour plot of estimated read length vs mean quality scores in semilog scaleAlthough we recommend to stick to default values, all 3 methods allow users to customize the plots.
nbins for the 1D plots and len_nbins / qual_nbins for the 2D plotcolor/colorscalehelp(pycoQC.reads_len_1D)
# Run cell with Ctrl + Enter
p = pycoQC("./data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.reads_len_1D()
iplot(fig, show_link=False)
help(pycoQC.reads_qual_1D)
# Run cell with Ctrl + Enter
p = pycoQC("./data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.reads_qual_1D()
iplot(fig, show_link=False)
help(pycoQC.reads_len_qual_2D)
# Run cell with Ctrl + Enter
p = pycoQC("./data/*RNA*")
fig = p.reads_len_qual_2D ()
iplot(fig, show_link=False)
pycoQC can generate plot showing the evolution of the sequencing output (output_over_time) as well as the mean read quality (qual_over_time) over the course of the sequencing run.
Please be aware that if there are multiple run IDs in the source file(s), pycoQC reorder the run IDS by decreasing throughput/second as explained in Initialisation. This means that the over_time plots could be wrong, particularly when mixing several runs together.
For both functions the argument smooth_sigma can be used to modulate the smoothing factor of the gaussian filter, if you are not satisfied with the default result.
The colors of both plots can be fully customised:
cumulative_color and interval_color for output_over_timemedian_color, quartile_color and extreme_color for quality_over_timehelp(pycoQC.output_over_time)
# Run cell with Ctrl + Enter
p = pycoQC ("./data/Albacore-1.2.1_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.output_over_time ()
iplot(fig, show_link=False)
help (pycoQC.qual_over_time)
# Run cell with Ctrl + Enter
p = pycoQC ("./data/Albacore-2.1.10_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.qual_over_time ()
iplot(fig, show_link=False)
When barcoding information is available, it is possible to generate a pie chart of the barcode count distribution. If no barcode information is available pycoQC throws an error.
It is not rare to have non-relevant barcodes detected at very low level. By default any barcode below 0.1% of the reads is excludes from the plot, but this can be changed with min_percent_barcode.
Similar to the previously described methods colors are customisable with colors.
help(pycoQC.barcode_counts)
# Run cell with Ctrl + Enter
p = pycoQC ("./data/Albacore-1.2.3_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.barcode_counts ()
iplot(fig, show_link=False)
Although the flowcell layout could be visually attractive (see https://github.com/mattloose/flowcellvis) this is not very informative on how the channels generate data during the run.
The channels_activity method generates a heatmap style plot showing the output over time per channel.
The number of channels can be changed to match Minion flowcells (512 default) or Promethion flowcells (3000).
The argument smooth_sigma can be used to modulate the smoothing factor of the gaussian smoothing filter
Colors can be changed with colorscale
help(pycoQC.channels_activity)
# Run cell with Ctrl + Enter
p = pycoQC ("./data/Albacore-1.2.1_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.channels_activity (width=1600)
iplot(fig, show_link=False)